Skip to content

perf: chunk long string byte escaping#809

Draft
He-Pin wants to merge 1 commit intodatabricks:masterfrom
He-Pin:split/pr776-byte-chunked-escape
Draft

perf: chunk long string byte escaping#809
He-Pin wants to merge 1 commit intodatabricks:masterfrom
He-Pin:split/pr776-byte-chunked-escape

Conversation

@He-Pin
Copy link
Copy Markdown
Contributor

@He-Pin He-Pin commented Apr 30, 2026

Motivation:

Split the JMH-positive, JDK17/JIT/GC-friendly long-string rendering piece out of #776. The original PR mixed renderer, stdlib, format, compareStrings, and Scala Native changes, and Native hyperfine was not clean enough to merge as one large PR.

Key Design Decision:

Keep this PR focused on byte-rendering long strings that contain JSON escapes. This PR does not include compareStrings, char materializer, stdlib asciiSafe/substr/join, or format char-array assembly changes from #776.

Modification:

  • Add CharSWAR.findFirstEscapeChar(byte[], from, to) on JVM, Scala.js, and Scala Native.
  • In BaseByteRenderer, keep the existing UTF-8 byte array for long strings, locate escape bytes, bulk-copy clean chunks with System.arraycopy, and escape only the matching bytes inline.
  • Precompute the exact escaped output length before writing dirty long strings so ByteBuilder does not grow repeatedly.

JDK17 / JIT / GC Notes:

  • Straight byte-array loops and System.arraycopy; no reflection or internal JDK APIs.
  • Reuses the existing UTF-8 byte-array allocation from master; no extra temporary arrays beyond the existing long-string encoding.
  • Clean long strings stay on the same bulk-copy fast path.
  • Dirty long strings avoid falling back to whole-string char escaping.
  • The JDK17 API that looks tempting here, HexFormat, is intentionally not used because per-control-char formatting would be a worse JIT/GC shape than the static hex table.

Focused Target JMH + GC:

JMH ran with the project compiled at the JDK17 level using the current Mill toolchain. Command shape:

./mill --no-server bench.runJmh sjsonnet.bench.RegressionBenchmark.main -p path=... -wi 3 -i 5 -r 3s -w 2s -f 1 -prof gc

Benchmark master ms/op PR ms/op Delta master alloc B/op PR alloc B/op GC note
large_string_template 1.686 +/- 0.027 1.398 +/- 0.464 -17.1% 7,775,106 7,774,803 allocation neutral/slightly lower
large_string_join 0.637 +/- 0.075 0.646 +/- 0.025 neutral 1,530,343 1,530,269 clean path neutral

Full JMH + GC Sweep:

All 36 regression benchmark inputs were covered. The full sweep command used:

./mill --no-server bench.runJmh sjsonnet.bench.RegressionBenchmark.main -p path="$PATHS" -wi 3 -i 5 -r 2s -w 1s -f 1 -prof gc -rf json

bench.07 needs the same larger stack that bench.runRegressions normally provides, so it was rerun separately with -jvmArgsAppend -Xss100m. The full sweep is a screening run, not a claim that this renderer-only PR improves unrelated stdlib/parser cases; several unrelated rows had obvious system/JIT outliers. There were no clear time regressions by JMH error interval overlap.

Benchmark master ms/op PR ms/op Delta Alloc delta
assertions 0.205 0.209 +1.8% +1.11%
bench.01 0.052 0.048 -7.4% +2.47%
bench.02 28.057 26.910 -4.1% -0.00%
bench.03 7.048 7.244 +2.8% +0.00%
bench.04 0.116 0.118 +1.9% -0.01%
bench.06 0.578 0.217 -62.5% +0.32%
bench.07 2.754 2.465 -10.5% -0.00%
bench.08 0.956 0.038 -96.0% -4.24%
bench.09 0.332 0.044 -86.8% +2.52%
gen_big_object 2.697 0.803 -70.2% -0.06%
large_string_join 0.588 0.584 -0.8% -0.04%
large_string_template 1.631 1.260 -22.8% -0.01%
realistic1 1.447 1.610 +11.3% +0.00%
realistic2 43.015 42.317 -1.6% +0.00%
base64 0.156 0.151 -3.4% +0.00%
base64Decode 0.125 0.118 -5.4% +0.00%
base64DecodeBytes 5.348 5.228 -2.2% -0.02%
base64_byte_array 0.851 0.775 -8.9% -0.00%
base64_stress 0.192 0.177 -8.0% -0.01%
comparison 0.070 0.033 -53.3% -0.12%
comparison2 143.021 44.546 -68.9% -0.02%
escapeStringJson 0.798 0.057 -92.8% -1.96%
foldl 0.091 0.101 +10.6% +0.28%
lstripChars 0.122 0.114 -6.4% +0.01%
manifestJsonEx 1.035 0.052 -95.0% -4.31%
manifestTomlEx 1.128 0.068 -94.0% -1.92%
manifestYamlDoc 0.057 0.056 -1.1% -0.92%
member 0.660 0.639 -3.1% -0.00%
parseInt 0.084 0.032 -61.7% -0.09%
reverse 34.736 6.770 -80.5% -0.03%
rstripChars 0.122 0.122 -0.3% -0.01%
stripChars 0.129 0.115 -10.9% +0.01%
substr 0.060 0.056 -6.4% +0.01%
setDiff 0.418 0.392 -6.1% -0.07%
setInter 0.358 0.350 -2.4% -0.13%
setUnion 0.625 0.622 -0.5% +0.04%

Focused Rechecks:

Rows that looked suspicious in the full sweep were rerun with longer settings. The only stable allocation concern from the raw table was bench.09; a 3-fork rerun made it neutral/slightly lower. bench.06 was also rerun because the raw full sweep showed a small allocation delta.

Benchmark master ms/op PR ms/op Delta Alloc delta
bench.06 0.219 0.215 -1.7% +0.05%
bench.09 0.042 0.042 -1.2% -0.36%

Scala Native Hyperfine:

Native artifacts were built with ./mill --no-server 'sjsonnet.native[3.3.7].nativeLink' on both master and this branch. Each hyperfine command loops 20 CLI invocations, uses --warmup 5 --runs 60, and the table divides the reported mean/median back to per-invocation milliseconds.

Benchmark master mean ms PR mean ms Delta master median ms PR median ms Median delta
large_string_template 11.60 +/- 0.98 10.30 +/- 0.82 -11.3% 11.32 9.95 -12.1%
large_string_join 6.01 +/- 0.12 6.02 +/- 0.16 neutral 5.98 5.98 neutral

Correctness Review:

  • visitLongString is only used for String values when escapeUnicode = false, matching the existing ByteRenderer path.
  • UTF-8 continuation bytes are always >= 0x80, so scanning the UTF-8 byte array for ", \, or < 0x20 cannot falsely match inside a non-ASCII code point.
  • Control characters U+0000 through U+001F remain single-byte UTF-8 and are emitted as the same JSON escapes as the old RenderUtils.escapeByte fallback.
  • Buffer safety was rechecked: every chunk copy reads elemBuilder.arr after ensureLength, so a grow cannot leave a stale array reference.
  • JVM and Native CLI parity checks against master passed for long clean ASCII, long non-ASCII, quote/backslash, and control-character mixed strings.

Rejected Splits From #776:

  • Format.scala char-array assembly: not JMH-positive on current master.
  • length/substr/asciiSafe/join group: substr regressed, so it should not be split out as-is.
  • std.join exact-capacity builder: allocation improved in one run, but no-prof JMH regressed.
  • compareStrings/SWAR group: too broad and not GC-proven for a focused first split.
  • JVM String.indexOf escape scan: tiny signal only, not enough for a separate PR.

Verification:

  • ./mill --no-server 'sjsonnet.jvm[3.3.7].compile'
  • ./mill --no-server 'sjsonnet.jvm[3.3.7].checkFormat'
  • ./mill --no-server 'sjsonnet.jvm[3.3.7].test'
  • ./mill --no-server 'sjsonnet.native[3.3.7].nativeLink'
  • Full JMH+GC sweep over all 36 regression benchmark inputs
  • Focused JMH+GC rechecks for suspicious full-sweep rows
  • Native hyperfine commands above
  • JVM/Native output parity checks against master for long string escape edge cases

References:

Motivation:
Split the JMH-positive long-string rendering piece out of databricks#776 without carrying over the broader Scala Native render-pipeline experiment.

Modification:
- Add CharSWAR.findFirstEscapeChar for byte arrays on JVM, JS, and Native.
- Keep the existing UTF-8 byte array for long strings, but locate escape bytes and copy clean chunks with System.arraycopy.
- Escape only the matching bytes inline.
- Precompute the exact escaped output length before writing dirty strings so ByteBuilder does not grow repeatedly.

Result:
This keeps the change JDK17/JIT/GC friendly: straight byte-array loops, no internal JDK APIs, no extra temporary arrays beyond the existing UTF-8 encoding, and no regression on clean long strings.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant